Red Wine Quality Analysis by Jorge Riera

The following study explores various characteristics of red wine including pH, citric acid, and residual sugars to determine which factors could be used to predict quality. These characteristics are compared to the ratings of three wine experts. The relationships between these variables are explored throughout this study.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This dataset consists of 13 variables with 1599 observations.

##   75% 
## 12.35
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Most wine have a fixed acidity between 6 and 11. There is a slight skew to the right with most wines having a fixed acidity of 11 or lower. There are multiple outliers in this distribution where the wines have fixed acidities higher than 12.35.

##   75% 
## 1.015
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The wine with the lowest volatile acidity has a score of 0.12 and the highest has 1.58. Above, I plot the main body of volatile acidity, trimming those with the highest levels. There appears to be a bi modal distribution. There are multiple outliers in this distribution where the wines have volatile acidities higher than 1.015.

##   75% 
## 0.915
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

I transformed the long tail data to better understand the distribution of citric acidity. The wine with the lowest citric acid has a score of 0 while the highest has a score of 1. Most values lie between 0.09 and 0.42. The wine with the citric acid level of 1 is the only outlier in this distribution.

##  75% 
## 3.65
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most wines have a residual sugar level value between 1.9 and 2.6. I plotted those values that lie in this range and they appear to be normally distributed. The wine with the lowest residual sugar has a value of 0.9 and the highest has a score of 15.5. There are multiple outliers in this distribution where the wines have residual sugar levels higher than 3.65.

##  75% 
## 0.12
##  25% 
## 0.04
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The chloride levels for the wines in the dataset range from 0.012 to 0.611. Most values lie between 0.07 and 0.09. I plotted the chloride levels for wines that had values within this range. These values appear to have a normal distribution. There are multiple outliers in this distribution where the wines have chloride levels higher than 0.12. There are also a handful of wines in the lower end of the distribution that are outliers with chloride levels less than 0.04.

## 75% 
##  42
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## free.sulfur.dioxide
##    1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
##    3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
##   30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
##   16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
##   43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
##    3    3    1    1    4    2    4    3    1    1    2    1    1    2    1

Transforming the plot of free sulfur dioxide reveals a bi modal distribution. Most wines have values between 7 and 21. The lowest value is 1 and the highest is 72. There are multiple outliers in this distribution where the wines have free sulfur dioxide levels higher than 42.

## 75% 
## 122
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## total.sulfur.dioxide
##    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
##    3    4   14   14   27   26   29   28   33   35   26   27   35   29   33 
##   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35 
##   25   25   34   36   27   24   30   43   20   14   32   20   17   20   26 
##   36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
##   12   26   31   16   17   14   26   18   23   20   17   24   21   21   11 
##   51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
##   11   15   14   20   13   10    6   14    9   18    9    9   13   10   17 
##   66   67   68   69   70   71   72   73   74   75   76   77 77.5   78   79 
##    9   12   10    8    8    7   10    7    8    5    3    8    2    4    5 
##   80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
##    4    6    4    2    6    9   10    6   14    9    5    7    8    2    8 
##   95   96   98   99  100  101  102  103  104  105  106  108  109  110  111 
##    4    5    7    6    3    4    6    2    5    5    6    3    4    6    3 
##  112  113  114  115  116  119  120  121  122  124  125  126  127  128  129 
##    3    4    2    2    1    7    2    4    3    3    2    1    2    2    3 
##  130  131  133  134  135  136  139  140  141  142  143  144  145  147  148 
##    1    3    3    2    2    2    1    1    3    1    2    3    3    3    2 
##  149  151  152  153  155  160  165  278  289 
##    1    2    1    1    1    1    1    1    1

Total sulfur dioxide levels range between 6 and 289. Most values lie between 22 and 62. The maximum value is 289 and the minimum is 6. There are multiple outliers in this distribution where the wines have total sulfur dioxide levels higher than 122.

##      75% 
## 1.001187
##       25% 
## 0.9922475

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density of the wine’s in the dataset range from 0.9901 to 1.004. Most wines lie between 0.9956 and 0.9978. There are multiple outliers in this distribution where the wines have densities that are either higher than 1.001187 or lower than 0.9922475.

##   75% 
## 3.685
##   25% 
## 2.925
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## pH
## 2.74 2.86 2.87 2.88 2.89  2.9 2.92 2.93 2.94 2.95 2.98 2.99    3 3.01 3.02 
##    1    1    1    2    4    1    4    3    4    1    5    2    6    5    8 
## 3.03 3.04 3.05 3.06 3.07 3.08 3.09  3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 
##    6   10    8   10   11   11   11   19    9   20   13   21   34   36   27 
## 3.18 3.19  3.2 3.21 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29  3.3 3.31 3.32 
##   30   25   39   36   39   32   29   26   53   35   42   46   57   39   45 
## 3.33 3.34 3.35 3.36 3.37 3.38 3.39  3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 
##   37   43   39   56   37   48   48   37   34   33   17   29   20   22   21 
## 3.48 3.49  3.5 3.51 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59  3.6 3.61 3.62 
##   19   10   14   15   18   17   16    8   11   10   10    8    7    8    4 
## 3.63 3.66 3.67 3.68 3.69  3.7 3.71 3.72 3.74 3.75 3.78 3.85  3.9 4.01 
##    3    4    3    5    4    1    4    3    1    1    2    1    2    2

The pH levels of wine range from 2.74 to 4.01. Most have values between 3.21 and 3.4. There are multiple outliers in this distribution where the wines have pH levels that are either higher than 3.685 or lower than 2.925.

## 75% 
##   1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## sulphates
## 0.33 0.37 0.39  0.4 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 
##    1    2    6    4    5    8   16   12   18   19   29   31   27   26   47 
## 0.53 0.54 0.55 0.56 0.57 0.58 0.59  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 
##   51   68   50   60   55   68   51   69   45   61   48   46   41   42   36 
## 0.68 0.69  0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 
##   35   23   33   26   28   26   26   20   25   26   23   18   19   15   22 
## 0.83 0.84 0.85 0.86 0.87 0.88 0.89  0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 
##   15   13   14   13   13    7    7    8    8    5   10    4    2    3    6 
## 0.98 0.99    1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09  1.1 1.11 1.12 
##    2    3    1    1    3    2    2    3    4    2    3    1    2    1    1 
## 1.13 1.14 1.15 1.16 1.17 1.18  1.2 1.22 1.26 1.28 1.31 1.33 1.34 1.36 1.56 
##    2    2    1    1    5    3    1    1    1    2    1    1    1    3    1 
## 1.59 1.61 1.62 1.95 1.98    2 
##    1    1    1    2    1    1

Sulphate levels range from 0.33 to 2.00. Most values lie between 0.55 and 0.73. There is a right skew in the distribution of sulphate levels. There are multiple outliers in this distribution where the wines have sulphate levels that are higher than 1.

##  75% 
## 13.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Most wines have alcohol levels between 9.5 and 11.1. The minimum is 8.4 and the maximum value is 14.9. There is a right skew in the distribution of alcohol levels. There are multiple outliers in this distribution where the wines have alcohol levels that are higher than 13.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## quality
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The quality ratings of wine range from 3 to 8. Most wines having a rating of 5 or 6.

Univariate Analysis

Structure of dataset

The dataset contains 1599 types of wines each with 11 independent variables that are based on physicochemical tests and one dependent variable. The dependent variable is the measurement of quality which is based on sensory data provided by wine experts.

Main features of interest

The univariate analysis of the red wine dataset revealed interesting distributions in alcohol, fixed acidity, and sulphates. Each of these variables has long tails which I will investigate further in my bi-variate analysis. I suspect that these variables have an influence on the quality of the wine.

Other features of interest

The other two variables that I will examine in more detail are citric acid and total sulfur dioxide. Citric acid is added to finished wines to increase acidity and give the wine a ‘fresh flavor.’ If too much is added, it can increase the formation of volatile acid which gives wines a vinegar taste. There may be a ‘sweet spot’ for this variable which I will explore in more detail. Plotting the values of total sulfur dioxide revealed a bi modal distribution which may be related to the quality of the wine.

Grouping Wines Based on Quality

I added an additional variable called ‘binary.quality’ that will divide the wines into two categories- good or bad. Wines that have a quality rating of 6 or higher are deemed good. Any wine with a rating less than 6 is considered bad. Creating these two groups will help to uncover patterns from a bi-variate and multivariate analysis.

Bivariate Plots Section

Pair Plot

A pair-plot of the variables in the dataset reveals that alcohol, volatile acidity, and citric acid seem to have a strong relationship with the quality of a wine. I will explore these relationships further.

## wine$binary.quality: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1800  0.4600  0.5900  0.5895  0.6800  1.5800 
## -------------------------------------------------------- 
## wine$binary.quality: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3500  0.4600  0.4741  0.5800  1.0400

The boxplot above shows that the variance for volatile acidity in good and bad wines is similar. Bad wines have a higher mean and median. The mean and median volatile acidity for good wines was 0.4741 and 0.46 respectively. The mean and median volatile acidity for bad wines was 0.5895 and 0.59 respectively. A frequency polygon of volatile acidity by wine quality shows that the higher the volatile acidity, the more likely a wine is to be of bad quality. Most wines above a volatile acidity of 0.6 were of bad quality.

## wine$binary.quality: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5200  0.5800  0.6185  0.6500  2.0000 
## -------------------------------------------------------- 
## wine$binary.quality: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.5900  0.6600  0.6926  0.7700  1.9500

Sulphate levels also vary between wines of different qualities. Good quality wines had higher mean and median sulphate levels. The mean and median values for good wines were 0.6926 and 0.66 respectively. Wines of poor quality had mean and median sulphate levels of 0.6185 and 0.58 respectively. A frequency polygon of sulphate levels by wine quality reveals that those with higher levels have a greater likelihood of being good wines. Wines that had sulphate levels of 0.63 or higher tend to be of good quality.

## wine$binary.quality: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.400   9.700   9.926  10.300  14.900 
## -------------------------------------------------------- 
## wine$binary.quality: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40   10.00   10.80   10.86   11.70   14.00

Alcohol content varies depending on the quality of the wine. Wines of good quality have a higher variance as well as a higher mean and median. The mean and median alcohol content for good wines was 10.86 and 10.8 respectively. Bad wines had a mean alcohol content of 9.926 and a median of 9.7. A frequency polygon of alcohol content by wine quality shows that wines with alcohol levels above 10.25 have a higher likelihood of being of good quality.

## wine$binary.quality: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.75   45.00   54.65   78.00  155.00 
## -------------------------------------------------------- 
## wine$binary.quality: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   20.00   33.00   39.35   50.00  289.00

Wines of poor quality have a larger variance in total sulfur dioxide levels as well as a higher mean and median. Wines of bad quality have a median of 45 and a mean of 54.65. Wines of good quality have a mean of 39.35 and a median of 33. A plot of the frequency of total sulfur dioxide levels by wine quality suggests that wines with levels above 80 are more likely to be of bad quality.

## wine$binary.quality: Bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0800  0.2200  0.2378  0.3600  1.0000 
## -------------------------------------------------------- 
## wine$binary.quality: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1150  0.3100  0.2999  0.4600  0.7800

A box plot of the citric acid content of wines by quality reveals that good wines have a higher median and variance than bad wines. Good wines also have a higher mean citric acid. The mean and median levels for good wines are 0.46 and 0.31 respectively. Bad wines have a mean of 0.2378 and a median of 0.22. Plotting the frequency of citric acid levels by wine quality suggests that good wines have higher citric acid levels. One notable aspect of this data is that it does not show that at high levels, citric acid can affect the quality of the wine. Citric acid can increase the formation of volatile acid. One would expect that there would be a resurgence of bad wines at higher levels.

Bivariate Analysis

Relationships between features

## [1] -0.06166827

A bi-variate analysis of the dataset reveals that good wines have tend to have a higher alcohol content as well as lower levels of volatile acidity. These two variables do not seem to have a significant correlation with each other, but I will explore this relationship in more detail in my multivariate analysis.

Other Observations

One interesting relationship that wasn’t a main feature of interest, is that of fixed acidity and density. These two variables seem to have a significant correlation. This will be explored further.

Strongest Relationship

## [1] -0.6829782

Fixed acidity and pH have the strongest correlation among all the variables analyzed in this dataset. The correlation coefficient for these two variables is -0.6829782. This would make sense because the pH scale is a measure of how basic or acidic a substance is. One would expect a high correlation for these two variables.

Multivariate Plots Section

## [1] 0.6680473

In the previous section, I noted that density and fixed acidity seem to have a significant correlation. I examined this relationship further by creating a scatter plot of the two variables and labeling each point based on the quality of the underlying wine. It seems like at lower fixed acidity levels, good wines tend to have lower densities.

It seems like the higher the alcohol content and lower the volatile acidity of a wine, the more likely it is to be of good quality.

Higher sulphate levels and alcohol content also seem to be indicators of a good wine.

Similarly, wines with higher alcohol and citric acid levels tend to be of good quality.

Multivariate Analysis

Wines with higher alcohol, volatile acidity, and sulphate levels tend to be of better quality. One could use these features to construct a linear model to predict the quality of a red wine.

Building a Model Using the Red Wine Dataset

## 
## Calls:
## m1: lm(formula = I(quality ~ alcohol), data = wine)
## m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wine)
## m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
##     data = wine)
## 
## =====================================================
##                        m1         m2         m3      
## -----------------------------------------------------
##   (Intercept)        1.875***   3.095***   2.611***  
##                     (0.175)    (0.184)    (0.196)    
##   alcohol            0.361***   0.314***   0.309***  
##                     (0.017)    (0.016)    (0.016)    
##   volatile.acidity             -1.384***  -1.221***  
##                                (0.095)    (0.097)    
##   sulphates                                0.679***  
##                                           (0.101)    
## -----------------------------------------------------
##   R-squared              0.2        0.3        0.3   
##   adj. R-squared         0.2        0.3        0.3   
##   sigma                  0.7        0.7        0.7   
##   F                    468.3      370.4      268.9   
##   p                      0.0        0.0        0.0   
##   Log-likelihood     -1721.1    -1621.8    -1599.4   
##   Deviance             805.9      711.8      692.1   
##   AIC                 3448.1     3251.6     3208.8   
##   BIC                 3464.2     3273.1     3235.7   
##   N                   1599       1599       1599     
## =====================================================

Predictions

Example red wine:

Alcohol: 13, Volatile Acidity: 0.32, Sulphates: 0.85, Quality: 7

##        fit      lwr      upr
## 1 6.816987 5.521736 8.112239

It appears that a model that includes alcohol, volatile acidity, and sulphates can be used to predict the quality of a wine. The wine that was used to test this model had a quality rating of 7. The model predicted a rating of 6.816987. One limitation of this model is that since it uses linear regression, it is sensitive to outliers. For example, there may be cases were a good wine may have a relatively low alcohol conent, which would affect the accuracy of this model.


Final Plots and Summary

Volatile Acidity and Wine Quality

Most wines with a volatile acidity less than 0.6 tend to be of good quality. Wines with a volatile acidity higher than 0.6 were more likely to be of bad quality.

Alcohol Content and Wine Quality

Most wines with an alcohol content greater than 10% are of good quality. There are significantly more wines with alcohol contents between 9 and 10% that are of bad quality.

Fixed Acidity, Alcohol Content, and Wine Quality

Wines with higher alcohol contents and lower volatile acidity levels were more likely to be of good quality. Most wines with alcohol contents greater than 10% and volatile acidity levels lower than 0.6 (g / dm^3) are of good quality.


Reflection

Exploring the wine dataset revealed a relationship between the volatile acidity, alcohol content, and quality of a wine. A univariate analysis of these variables revealed skewed distributions, which seem to be a function of the underlying quality of the wines. Prior to conducting a bi-variate and multivariate analysis, the wines were divide into two groups(good/bad). Encoding the wines with a binary variable helped to discern which features had the most influence over the quality of wines. The higher the alcohol content of a wine, the more likely it is of good quality. Good quality wines also have lower volatile acidity levels. A mulivariate analysis of these features suggested that the quality of a wine could be predicted based on volatile acidity and alcohol levels. A linear regression model was constructed to predict the quality of a wine based off its alcohol content, sulphates, and volatile acidity.Further work may be done to assess the accuracy of this model. A larger dataset may reveal other findings that were not included in this study.

Citations

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.